21 research outputs found
On a Model for Integrated Information
In this paper we give a thorough presentation of a model proposed by Tononi
et al. for modeling \emph{integrated information}, i.e. how much information is
generated in a system transitioning from one state to the next one by the
causal interaction of its parts and \emph{above and beyond} the information
given by the sum of its parts. We also provides a more general formulation of
such a model, independent from the time chosen for the analysis and from the
uniformity of the probability distribution at the initial time instant.
Finally, we prove that integrated information is null for disconnected systems
Differentially Private Graph Learning via Sensitivity-Bounded Personalized PageRank
Personalized PageRank (PPR) is a fundamental tool in unsupervised learning of
graph representations such as node ranking, labeling, and graph embedding.
However, while data privacy is one of the most important recent concerns,
existing PPR algorithms are not designed to protect user privacy. PPR is highly
sensitive to the input graph edges: the difference of only one edge may cause a
big change in the PPR vector, potentially leaking private user data.
In this work, we propose an algorithm which outputs an approximate PPR and
has provably bounded sensitivity to input edges. In addition, we prove that our
algorithm achieves similar accuracy to non-private algorithms when the input
graph has large degrees. Our sensitivity-bounded PPR directly implies private
algorithms for several tools of graph learning, such as, differentially private
(DP) PPR ranking, DP node classification, and DP node embedding. To complement
our theoretical analysis, we also empirically verify the practical performances
of our algorithms
Differentially Private Continual Releases of Streaming Frequency Moment Estimations
The streaming model of computation is a popular approach for working with large-scale data. In this setting, there is a stream of items and the goal is to compute the desired quantities (usually data statistics) while making a single pass through the stream and using as little space as possible.
Motivated by the importance of data privacy, we develop differentially private streaming algorithms under the continual release setting, where the union of outputs of the algorithm at every timestamp must be differentially private. Specifically, we study the fundamental ?_p (p ? [0,+?)) frequency moment estimation problem under this setting, and give an ?-DP algorithm that achieves (1+?)-relative approximation (? ? ? (0,1)) with polylog(Tn) additive error and uses polylog(Tn)? max(1, n^{1-2/p}) space, where T is the length of the stream and n is the size of the universe of elements. Our space is near optimal up to poly-logarithmic factors even in the non-private setting.
To obtain our results, we first reduce several primitives under the differentially private continual release model, such as counting distinct elements, heavy hitters and counting low frequency elements, to the simpler, counting/summing problems in the same setting. Based on these primitives, we develop a differentially private continual release level set estimation approach to address the ?_p frequency moment estimation problem.
We also provide a simple extension of our results to the harder sliding window model, where the statistics must be maintained over the past W data items
Bisect and Conquer: Hierarchical Clustering via Max-Uncut Bisection
Hierarchical Clustering is an unsupervised data analysis method which has
been widely used for decades. Despite its popularity, it had an underdeveloped
analytical foundation and to address this, Dasgupta recently introduced an
optimization viewpoint of hierarchical clustering with pairwise similarity
information that spurred a line of work shedding light on old algorithms (e.g.,
Average-Linkage), but also designing new algorithms. Here, for the maximization
dual of Dasgupta's objective (introduced by Moseley-Wang), we present
polynomial-time .4246 approximation algorithms that use Max-Uncut Bisection as
a subroutine. The previous best worst-case approximation factor in polynomial
time was .336, improving only slightly over Average-Linkage which achieves 1/3.
Finally, we complement our positive results by providing APX-hardness (even for
0-1 similarities), under the Small Set Expansion hypothesis
Measuring Re-identification Risk
Compact user representations (such as embeddings) form the backbone of
personalization services. In this work, we present a new theoretical framework
to measure re-identification risk in such user representations. Our framework,
based on hypothesis testing, formally bounds the probability that an attacker
may be able to obtain the identity of a user from their representation. As an
application, we show how our framework is general enough to model important
real-world applications such as the Chrome's Topics API for interest-based
advertising. We complement our theoretical bounds by showing provably good
attack algorithms for re-identification that we use to estimate the
re-identification risk in the Topics API. We believe this work provides a
rigorous and interpretable notion of re-identification risk and a framework to
measure it that can be used to inform real-world applications